Show code
pacman::p_load(jsonlite, tidygraph, ggraph,igraph,lsa,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse, plotly, naniar, tm, topicmodels, ldatuning)Huynh Minh Phuong
FishEye International, a non-profit organization dedicated to combatting the scourge of illegal, unreported and unregulated (IUU) fishing, has been granted access to fishing-related companies’ financial database, offered by an international finance corporation. Through their previous investigations, FishEye has discovered that companies with unusual arrangements are more likely to be engaged in IUU activities or other questionable practices. To leverage this valuable resource, FishEye has transformed the database into a comprehensive knowledge graph, which encompasses data on companies, owners, workers and revenue.
The primary objective of our project is to employ this graph to detect irregularities that may indicate a company’s involvement in IUU fishing.
Dataset contains the knowledge graph with 27,622 nodes and 24,038 edges:
Node Attributes:
type – Type of node as defined above.
country – Country associated with the entity. This can be a full country or a two-letter country code.
product_services – Description of product services that the “id” node does.
revenue_omu – Operating revenue of the “id” node in Oceanus Monetary Units.
id – Identifier of the node is also the name of the entry.
role – The subset of the “type” node, not in every node attribute.
Edge Attributes:
type – Type of the edge as defined above.
source – ID of the source node.
target – ID of the target node.
dataset – Always “MC3”.
role - The subset of the “type” node, not in every edge attribute.
We first extract links data into a tibble using as_tibble()
Rows: 24,038
Columns: 3
$ source <list> "Lake Chad Catchers Limited Liability Company Worldwide", "La…
$ target <list> "Erin Flores", "Linda Lee", "Sharon Coleman", "John Rivera", "…
$ type <list> "Beneficial Owner", "Beneficial Owner", "Beneficial Owner", "B…
Next, we will wrangle the data for edges:
distinct() is used to ensure that there will be no duplicated records.
mutate() and as.character() are used to convert the field data type from list to character.
group_by() and summarise() are used to count the number of unique links.
the filter(source!=target) is to ensure that no record with similar source and target.
We use as_tibble() to extract node data into a tibble.
We wrangle the node data:
mutate() and as.character() are used to convert the field data type from list to character.
To convert revenue_omu from list data type to numeric data type, we need to convert the values into character first by using as.character(). Then, as.numeric() will be used to convert them into numeric data type.
select() is used to re-organise the order of the fields.
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.
| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
The report above reveals that there is no missing values in all fields.
datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.
We use ggplot to plot the frequency of different types of relationship in the edges. We have two types: beneficial owner and company contacts.
We take a look at the number of companies under each owner. Most owners have only 1 company. We will extract the graph for those owners with more than 3 companies under them. There are 67 such owners. We will look at the network of these owners.
owner_companycnt<-mc3_edges %>%
filter(type=='Beneficial Owner') %>%
group_by(target) %>%
summarise(company_count=n()) %>%
arrange(desc(company_count)) %>%
ggplot(aes(x=company_count)) +
geom_bar()+
scale_x_continuous(breaks=c(1:10))+
theme_minimal()+
ggtitle('Frequency Count by Number of Companies under each owner')
ggplotly(owner_companycnt)Use skim() of skimr package is used to display the summary statistics of mc3_nodes tibble data frame.
| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
There are missing values in revenue_omu but no missing values in other variables: id, country, type, product_services
datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.
Plot the frequency of type of nodes using geom_bar()
Let’s take a look at the network of owners with high number of companies under them. We want to check if these companies under these owners are also co-owned by other owners. If a owner owns many companies that are not co-owned by others, these are likely shell companies.
# Extract company information
cp_oh<-mc3_edges %>%
filter(type=='Beneficial Owner') %>%
filter(target %in% owners$target) %>%
select(source)
# Extract edge information
oh_edges <- mc3_edges %>%
filter(type=='Beneficial Owner') %>%
filter(target %in% owners$target | source %in% cp_oh)
# Extract node information
oh_id1<-oh_edges %>%
select(source) %>%
rename(id = source) %>%
mutate(type='Company')
oh_id2 <- oh_edges %>%
select(target, type) %>%
rename(id = target)
oh_nodes <- rbind(oh_id1, oh_id2) %>%
distinct()
# Create owner_graph
oh_graph <- as_tbl_graph(oh_edges, directed = FALSE)
oh_graph<-oh_graph %>%
activate(nodes) %>%
left_join(oh_nodes, by=c("name"="id")) %>%
mutate(betweenness_centrality=centrality_betweenness()) %>%
mutate(degree_centrality=centrality_degree())
oh_graph# A tbl_graph: 362 nodes and 313 edges
#
# An unrooted forest with 49 trees
#
# A tibble: 362 × 4
name type betweenness_centrality degree_centrality
<chr> <chr> <dbl> <dbl>
1 Acevedo, Dickson and Gonzalez Company 0 1
2 Adams Group Company 0 1
3 Adams-Pope Company 0 1
4 Adriatic Catch S.A. de C.V. Company 0 1
5 Albertine Rift NV Family Company 0 1
6 Alexander PLC Company 0 1
# ℹ 356 more rows
#
# A tibble: 313 × 4
from to type weights
<int> <int> <chr> <int>
1 1 296 Beneficial Owner 1
2 2 297 Beneficial Owner 1
3 3 298 Beneficial Owner 1
# ℹ 310 more rows

edges_df<-oh_graph%>%
activate(edges) %>%
as.tibble()
nodes_df<-oh_graph%>%
activate(nodes) %>%
as.tibble() %>%
rename(label=name) %>%
rename(group=type) %>%
mutate(id=row_number())
visNetwork(nodes=nodes_df,edges=edges_df)%>%
visIgraphLayout(layout = "layout_with_fr") %>%
visLegend() %>%
visEdges(arrows = "to",
smooth = list(enabled = TRUE,
type = "curvedCW")) %>%
visNodes(font = list(size=30)) %>%
visLayout(randomSeed=123) %>%
visOptions(highlightNearest = TRUE,
nodesIdSelection = TRUE)We first needs to tokenize the words from products and services description and filter companies from node data.
# A tibble: 64,202 × 5
id country type revenue_omu word
<chr> <chr> <chr> <dbl> <chr>
1 Jones LLC ZH Company 310612303. automobiles
2 Coleman, Hall and Lopez ZH Company 162734684. passenger
3 Coleman, Hall and Lopez ZH Company 162734684. cars
4 Coleman, Hall and Lopez ZH Company 162734684. trucks
5 Coleman, Hall and Lopez ZH Company 162734684. vans
6 Coleman, Hall and Lopez ZH Company 162734684. and
7 Coleman, Hall and Lopez ZH Company 162734684. buses
8 Aqua Advancements Sashimi SE Express Oceanus Company 115004667. holding
9 Aqua Advancements Sashimi SE Express Oceanus Company 115004667. firm
10 Aqua Advancements Sashimi SE Express Oceanus Company 115004667. whose
# ℹ 64,192 more rows
We perform a frequency count for the words.

The bar chart reveals that the unique words contains some words that may not be useful to use. For instance “a” and “to”. In the word of text mining we call those words stop words. You want to remove these words from your analysis as they are fillers used to compose a sentence.

Following the completion of cleaning and visualization, the subsequent stage involves conducting Latent Dirichlet Allocation (LDA). LDA is an iterative algorithm utilized to reveal topics by examining discrete word frequencies. The underlying notion behind LDA is that documents typically pertain to a limited number of topics, and these topics are generally based on a small set of words. However, prior to that, it is necessary to construct a Document Term Matrix. This matrix mathematically represents the occurrence frequency of terms within a collection of documents. In the document-term matrix, each row corresponds to a document in the collection, while each column corresponds to a term.
We use ldatuning to select the number of topic for LDA model.
fit models... done.
calculate metrics:
Griffiths2004... done.
CaoJuan2009... done.
Arun2010... done.
Deveaud2014... done.
One straightforward method for analyzing metrics involves identifying extrema. For a more comprehensive understanding, please refer to the relevant papers:
Minimization:
Arun2010 [1] CaoJuan2009 [2] Maximization:
Deveaud2014 [3] Griffiths2004 [4,5] To facilitate easy analysis of the outcomes, the FindTopicsNumber_plot support function can be utilized.
We will check 3-8 number of topics to see which one is optimal.
lda_topics <- LDA(
dtm,
k = 3,
method = "Gibbs",
control = list(seed=42)
) %>%
tidy(matrix = "beta")
word_probs <- lda_topics %>%
group_by(topic) %>%
top_n(15, beta) %>%
ungroup() %>%
mutate(term2 = fct_reorder(term, beta))
ggplot(
word_probs,
aes(term2, beta, fill=as.factor(topic))
) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
Topic 1 involves companies with unknown products and services or miscellaneous categories. Topic 2 involves companies that deal with food. Topic 3 can be identified as industrial products.
We will run LDA again with 4 topics
lda_topics <- LDA(
dtm,
k = 4,
method = "Gibbs",
control = list(seed=42)
) %>%
tidy(matrix = "beta")
word_probs <- lda_topics %>%
group_by(topic) %>%
top_n(15, beta) %>%
ungroup() %>%
mutate(term2 = fct_reorder(term, beta))
ggplot(
word_probs,
aes(term2, beta, fill=as.factor(topic))
) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
We can see that the combination of words in topic 4 are not consistent. We also have the word products repeated in both 2 and 4. Therefore, for this analysis we will stick with only 3 industries.